-
Notifications
You must be signed in to change notification settings - Fork 49
Disentangle disparate 'bounds' ideas in processor #496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This starts the separation of two different ideas for boundaries in the base input: - subjectBounds: These represent the actual subject in the input string. For a `String` callee, this will cover the entire bounds, while for a `Substring` these will represent the bounds of the substring in the base. - searchBounds: These represent the current search range in the subject. These bounds can be the same as `subjectBounds` or a subrange when searching for subsequent matches or replacing only in a subrange of a string.
We should be testing the higher-level Regex.firstMatch function instead of the lower-level types directly.
str.replacing(_:with:subrange:) still needs to use separate bounds for the subject and the search range.
When we move forward while searching for the first match, the search bounds should stay the same. Only the currentPosition needs to move forward. This will allow us to implement the \G start of match anchor, with which /\Gab/ matches "abab" twice, compared with /^ab/, which only matches once.
With this change, RegexMatchesCollection keeps the subject bounds and search bounds separately, modifying the search bounds with each iteration. In addition, the replace methods that only operate on a subrange can specify that specifically, getting the correct anchor behavior while only matching within a portion of a string.
@swift-ci Please test |
@@ -20,44 +20,17 @@ struct MatchError: Error { | |||
} | |||
} | |||
|
|||
extension Executor { | |||
func _firstMatch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test method was unused.
Tests/RegexTests/MatchTests.swift
Outdated
firstMatchTest(#"["a-c"]+"#, input: "abc", match: "a", | ||
syntax: .experimental) | ||
syntax: .experimental, xfail: true) | ||
firstMatchTest(#"["abc"]+"#, input: "cba", match: "cba", | ||
syntax: .experimental) | ||
firstMatchTest(#"["abc"]+"#, input: #""abc""#, match: "abc", | ||
syntax: .experimental) | ||
syntax: .experimental, xfail: true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what the right behavior is for these experimental ones…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the same as the above, but we can also remove them. The quotes are just prettier \Q...\E
Tests/RegexTests/MatchTests.swift
Outdated
var executor = try _compileRegex(regex, syntax) | ||
executor.engine.enableTracing = enableTracing | ||
let (str, caps) = try executor._firstMatch( | ||
regex, input: input, enableTracing: enableTracing) | ||
let capStrs = caps.map { $0 == nil ? nil : String($0!) } | ||
return (String(str), capStrs) | ||
let regex = try Regex(regexStr) | ||
guard let result = try regex.firstMatch(in: input) else { | ||
throw MatchError("match not found for \(regexStr) in \(input)") | ||
} | ||
let caps = result.output.slices(from: input) | ||
return (String(input[result.range]), caps.map { $0.map(String.init) }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched this to testing the actual public API instead of the executor.
@@ -411,9 +426,9 @@ extension Processor { | |||
|
|||
case .consumeBy: | |||
let reg = payload.consumer | |||
guard currentPosition < bounds.upperBound, | |||
guard currentPosition < subjectBounds.upperBound, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Subject or search bounds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we, generally, know which one to choose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, we want to be moving and matching with searchBounds
; subjectBounds
should really only be used to match anchors, not for general purpose index movements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(i.e. this was incorrect)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any way we could use access control or some better coding convention here? E.g. if we always want search bounds except for anchors, can we isolate the code that refers to subject bounds inside the engine?
Tests/RegexTests/MatchTests.swift
Outdated
firstMatchTest(#"["a-c"]+"#, input: "abc", match: "a", | ||
syntax: .experimental) | ||
syntax: .experimental, xfail: true) | ||
firstMatchTest(#"["abc"]+"#, input: "cba", match: "cba", | ||
syntax: .experimental) | ||
firstMatchTest(#"["abc"]+"#, input: #""abc""#, match: "abc", | ||
syntax: .experimental) | ||
syntax: .experimental, xfail: true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the same as the above, but we can also remove them. The quotes are just prettier \Q...\E
This still isn't quite right - Processor.consume() should really return the matched range, since it already tracks the starting point of matching. Having it return the whole range will also prepare us for supporting `\K`, which will mean `currentPosition` isn't always the start of the matched range.
This at least clarifies which boundaries are being used.
@swift-ci Please test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, some minor cleanups and testing requests.
These anchors should always match at start/end of line, not just when .anchorsMatchLineEndings() is turned on.
@swift-ci Please test |
This separates the two different ideas for boundaries in the base input: - subjectBounds: These represent the actual subject in the input string. For a `String` callee, this will cover the entire bounds, while for a `Substring` these will represent the bounds of the substring in the base. - searchBounds: These represent the current search range in the subject. These bounds can be the same as `subjectBounds` or a subrange when searching for subsequent matches or replacing only in a subrange of a string. * firstMatch shouldn't update searchBounds on iteration When we move forward while searching for the first match, the search bounds should stay the same. Only the currentPosition needs to move forward. This will allow us to implement the \G start of match anchor, with which /\Gab/ matches "abab" twice, compared with /^ab/, which only matches once. * Make matches(of:) and ranges(of:) boundary-aware With this change, RegexMatchesCollection keeps the subject bounds and search bounds separately, modifying the search bounds with each iteration. In addition, the replace methods that only operate on a subrange can specify that specifically, getting the correct anchor behavior while only matching within a portion of a string.
This starts the separation of two different ideas for boundaries in the base input:
String
callee, this will cover the entire bounds, while for aSubstring
these will represent the bounds of the substring in the base.subjectBounds
or a subrange when searching for subsequent matches or replacing only in a subrange of a string.